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Abstract 

We describe the hyperparameter search problem in the field of machine learning and discuss 
its main challenges from an optimization perspective. Machine learning methods attempt to build 
models that capture some element of interest based on given data. Most common learning algorithms 
feature a set of hyperparameters that must be determined before training commences. The choice 
of hyperparameters can significantly affect the resulting model’s performance, but determining good 
values can be complex; hence a disciplined, theoretically sound search strategy is essential. 


1 Introduction 

Machine learning research focuses on the development of methods that are capable of capturing some 
element of interest from a given data set. Such elements include but are not limited to coherent structures 
within data (clustering) or the ability to predict certain target values based on given characteristics, which 
may be discrete (classification) or continuous (regression). 

A large variety of learning methods exist, ranging from biologically inspired neural networks |7] 
over kernel methods |29] to ensemble models |9, 11]. A common trait in these methods is that they are 
parameterized by a set of hyperpai'ameters A, which must be set appropriately by the user to maximize 
the usefulness of the learning approach. Hyperparameters are used to configure various aspects of the 
learning algorithm and can have wildly varying effects on the resulting model and its performance. 

Hyperparameter search is commonly performed manually, via rules-of-thumb 119, 20] or by testing 
sets of hyperparameters on a predefined grid |28]. These approaches leave much to be desired in terms of 
reproducibility and are impractical when the number of hypeiparameters is large 110]. Due to these flaws, 
the idea of automating hypeiparameter search is receiving increasing amounts of attention in machine 
learning, for instance via benchmarking suites 115] and various initiatives. ^ Automated approaches have 
already been shown to outperform manual search by experts on several problems |2, 5]. 

We briefly introduce some key challenges inherent to hyperparameter search in Section 2. The com¬ 
bination of all these hurdles make hyperparameter search a formidable optimization task. In Section 3 
we give a succinct overview of the current state-of-the-art in terms of algorithms and available software. 

1.1 Example: controlling model complexity 

A key balancing act in machine learning is choosing an appropriate level of model complexity: if the 
model is too complex, it will fit the data used to construct the model very well but generalize poorly 
to unseen data (overfitting); if the complexity is too low the model won’t capture all the information in 
the data (underfitting). This is often referred to as the bias-variance trade-off [ 12, 17], since a complex 
model exhibits large variance while an overly simple one is strongly biased. Most general-purpose 
methods feature hypeiparameters to control this trade-off; for instance via regularization as in support 
vector machines and regularization networks 116, 18]. 

1.2 Formalizing hyperparameter search 

The goal of many machine learning tasks can be summarized as training a model A4 which minimizes 
some predefined loss function M) on given test data Common loss functions include 

*Such as http://www.automl.org/and https://www.codalab.org/competitions/2 321 
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mean squared error and error rate. The model M is constructed by a learning algorithm A using a training 
set typically involving solving some (convex) optimization problem. The learning algorithm A 

may itself be parameterized by a set of hyperparameters A, e.g. A4 = ; A). An example model 

A4 is a support vector machine classifier with Gaussian kernel |29], for which the training problem A is 
parameterized by the regularization constant C and kernel bandwidth a, i.e. A = [C, a]. 

The goal of hyperparameter search is to find a set of hyperparameters A* that yield an optimal model 
A4* which minimizes ; A4). This can be formalized as follows 110]: 

A* = argmin£(X(*^); A)) = argmin.F(A; A, £). (1) 

A A 

The objective function T takes a tuple of hyperparameters A and returns the associated loss. The data 
sets X(‘'') and X(‘'=) are given and the learning algorithm A and loss function L are chosen . Depending 
on the learning task, and X^*®) may be labeled and/or equal to each other. In supervised learning, 
a data set is often split into X^^^) and X^*®) using hold-out or cross-validation methods 114, 24]. 


2 Challenges in hyperparameter search 

The characteristics of the search problem depend on the learning algorithm A, the chosen loss function 
£ and the data set X^*^), , as shown in Equation (1). Hyperparameter search is typically approached 

as a non-differentiable, single-objective optimization problem over a mixed-type, constrained domain. 
In this section we will discuss the origins and consequences of challenges in hyperparameter search. 

2.1 Costly objective function evaluations 

Each objective function evaluation requires evaluating the performance of a model trained with hyper¬ 
parameters A. Depending on the available computational resources, the nature of the learning algortihm 
A and size of the problem each evaluation may take considerable time. Training times 

in the order of minutes are considered fast, since days and even weeks are not unheard of 113, 25, 31]. 
Evaluation time is exacerbated when procedures that train multiple models are employed; for instance 
to reliably estimate generalization performance |14, 24]. This leads to an increasing need for efficient 
methods to optimize hyperparameters that require a minimal amount of objective function evaluations. 

Additionally, the time required to train and test models can be contingent upon the choice of hyperpa¬ 
rameters. Some hyperparameters have an obvious influence on train and/or test time, e.g. the architecture 
of neural networks |7] and size of ensembles |9, 11]. The influence of hyperparameters can also be sub¬ 
tle, for instance regularization and kernel complexity can significantly affect training time for support 
vector machines |8]. 

2.2 Randomness 

The objective function often exhibits a stochastic component, which can be induced by various compo¬ 
nents of the machine learning pipeline, for example due to inherent randomness of the learning algorithm 
(initialization of a neural network, resampling in ensemble approaches, ...) or due to finite sample effects 
in estimating generalization performance. This stochasticity can sometimes be addressed via machine 
learning techniques; but unfortunately such solutions typically dramatically increase the time required 
per objective function evaluation, limiting their usefulness in some settings. 

This inherent stochasticity directly implies that the empirical best hyperparameter tuple, obtained 
after a given set of evaluations, is not necessarily the true optimum of interest A*. Eortunately, many 
search methods are designed to probe many tuples close to the empirical best. If the search region 
surrounding the empirical optimum is densely sampled, we can determine whether the empirical best was 
an outlier or not in a post-processing phase, for instance by assuming Eipschitz continuity or smoothness. 
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2.3 Complex search spaces 

The number of hyperparameters is usually small (< 5), but it can range up to hundreds for complex 
learning algorithms |4] or when preprocessing steps are also subjected to optimization |22]. It has been 
demonstrated empirically that in many cases only a handful of hyperparameters significantly impact 
performance, though identifying the relevant ones in advance is difficult |2]. 

Hyperparameters are usually of continuous or integer type, leading to mixed-type optimization prob¬ 
lems. Continuous hyperparameters are commonly related to regularization. Common integer hypeipa- 
rameters are related to network architecture for neural networks |7], size of ensembles |9, 11] or the 
parameterization of kernels in kernel methods |29]. 

Some tasks feature highly complex search spaces, in which the very existence of certain hyperparam¬ 
eters are conditional upon the value of others |3, 5, 22]. A simple example is optimizing the architecture 
of neural networks |7], where the number of hidden layers is one hyperparameter and the size of each 
layer induces a set of additional hyperparameters, conditional upon the number of layers. 

3 Current approaches 

A wide variety of optimization methods have been used for hyperparameter search, including particle 
swarm optimization [26, 27], genetic algorithms |32], coupled simulated annealing |33] and racing algo¬ 
rithms 16]. Surprisingly, randomly sampling the search space was only established recently as a baseline 
for comparison of optimization methods |2]. Bayesian and related sequential model based optimiza¬ 
tion techniques using variants of the expected improvement criterion [23] are receiving a lot of attention 
currently 11, 5, 15, 21, 30], owing to their efficiency in ferms of objecfive function evaluations. 

Soffware packages are being released which implemenf various dedicafed optimization mefhods for 
hyperparamefer search. Such packages are usually infended fo be used in synergy wifh machine learning 
libraries fhaf provide learning algorifhms [28]. Mosf of fhese packages focus on Bayesian mefhods |3, 
22, 30], fhough mefaheurisfic opfimizafion approaches are also offered [10]. The increased developmenf 
of such packages fesfifies fowards fhe growing inferesf in aufomafed hyperparamefer search. 

4 Conclusion 

A fully aufomafed, self-configuring learning sfrafegy can be considered fhe holy grail of machine learn¬ 
ing. Though fhe currenf sfale-of-fhe-arf still has a long way fo go before fhis goal can be reached, if is 
evidenf fhaf hyperparamefer search is a crucial elemenf in ifs pursuif. Aufomafed hyperparamefer search 
is a hof topic wifhin fhe machine learning communify which we believe can benefil greafly from fhe 
fechniques and lessons leai'nf in mefaheurisfic opfimizafion. 
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